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Abstract 

Successful applications of reinforcement learning in real-world problems often 
require dealing with partially observable states. It is in general very challenging 
to construct and infer hidden states as they often depend on the agent’s entire 
interaction history and may require substantial domain knowledge. In this work, 
we investigate a deep-Ieaming approach to learning the representation of states in 
partially observable tasks, with minimal prior knowledge of the domain. In par¬ 
ticular, we propose a new family of hybrid models that combines the strength of 
both supervised learning (SL) and reinforcement learning (RL), trained in a joint 
fashion: The SL component can be a recurrent neural networks (RNN) or its long 
short-term memory (LSTM) version, which is equipped with the desired property 
of being able to capture long-term dependency on history, thus providing an ef¬ 
fective way of learning the representation of hidden states. The RL component 
is a deep Q-network (DQN) that learns to optimize the control for maximizing 
long-term rewards. Extensive experiments in a direct mailing campaign problem 
demonstrate the effectiveness and advantages of the proposed approach, which 
performs the best among a set of previous state-of-the-art methods. 


1 Introduction 


Consider customer relationship management (CRM) of a firm that interacts with users over time. At 
each decision point, the firm takes an action on its users, such as sending a catalog, a coupon or a 
greeting card. In response, a user may visit the store, place an order, or simply ignore the action. 
The goal of the firm is to take optimal actions to maximize total profits from users. In marketing, 
it is well established that actions taken by the firm can have a long-term effect on user response in 
the future, implying that myopic optimization of profit is usually sub-optimal. Instead, the life-time 
value (LTV) of users is a more desired metric of interest ( |Dwyer 19971. With LTV as the objective, 
CRM can be naturally formulated as a reinforcement-learning (RL) problem ( Sutton & Barto| 19981 
where the immediate profit is used as a reward and LTV as a long-term value function. A similar 
motivation was used in a recent application of RL to advertising (Theocharous et al. |2015|l. 


Like many other real-world problems, e.g., robotics and human-computer interaction applications, 
CRM is challenging partly because of the partial observability of a user’s (Markovian) state. 
Roughly speaking, a user’s state summarizes her entire interaction history with the firm: condi¬ 
tioned on the state and future actions, future response of the user is independent of the interaction 
history. In practice, constructing and measuring such a state is difficult in complex problems like 
CRM. Popular choices such as the Recency-Frequency-Monetary value model (details of which are 
given in experiments) arguably capture only partial information of a real user state. The problem of 
state inference therefore becomes critical when applying RL to these non-Markovian problems. 


The most common approach to dealing with partially observable states in reinforcement learning is 
to use a partially observable Markov decision process, or P OMDP (|Kaelbling et al.| 1998) 1, which 
is found successful in a few domains (Pineau et al. |2003[ Williams & Young) " 20071. However, 
defining hidden states in a POMDP requires substantial domain knowledge, while such knowledge 
is not always available (or hard to obtain) for many complex, real-world tasks. 
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In this work, inspired by the recent success of deep reinforcement learning (Mnih et al. 12015) 1, we 
investigate the use of deep neural networks to capture and infer hidden states in an automatic way. 
As opposed to POMDP-based approaches, deep learning holds the promise of automatically finding 
appropriate representations for a given problem, which can be difficult for a human expert, thus 
avoiding the laborious and challenging step of designing hidden states; see |Deng & Yu| ( |2014| l for 
an extensive survey of successful applications. In this paper, we propose a new, hybrid approach to 
using deep learning to tackle complex tasks. Our approach differs from previous ones in two aspects; 


• First, unlike Mnih et al. (2015 i, we employ recurrent neural networks (RNN) and long 
short-term memory (LSTM) ( Hochreiter & Schmidhuber^ 1997| l models to learn the repre¬ 
sentation of states for RL. Since these recurrent models can aggregate partial information 
in the past, and can capture long-term dependencies in the sequential information, their 
performance is expected be superior to the contextual-window-based approach, which was 
used in the DQN model of |Mnih et al.| ( |2015| l. 


• Second, in order to best leverage supervision signals in the training data, the proposed hy¬ 
brid approach combines the strength of both supervised learning and RL. In particular, the 
model in our hybrid approach is jointly learned using stochastic gradient descent (SGD): 
in each iteration, the representation of hidden states is first inferred using supervision sig¬ 
nals (i.e. next observation and reward) in the training data; then, the Q-function is updated 
using the DQN that takes the learned hidden states as input. The superiority of the hybrid 
approach is validated in extensive experiments on a benchmark dataset. 


In the rest of the paper, we will first review background information and related work. Then, we 
describe in detail our new, hybrid approach. The proposed approach is compared with previous 
methods in a public CRM benchmark in a series of experiments. Finally, the paper concludes with a 
discussion of future directions. 


2 Background and Related Work 
2.1 Reinforcement Learning 


In reinforcement learning, an agent uses observation and rewards to learn a (near-)optimal pol¬ 
icy for an environment that maximizes the expected total reward. Formally, in discrete steps 
t = 1 , 2 , 3,..., the agent receives an observation ot S O, takes an action at S A, and receives 
a real-valued reward r*, where O and A are the sets of observations and actions, respectively. Let 
ht = (oi, Ui, Ti,..., Ot-i, Ut-i, Tt-i, Ot) be an interaction history up to step t. The agent may se¬ 
lect actions according to a policy TT{ht) at step t. The goal of RL is find tt to maximize the expected 
discounted cumulative reward, R = for a given discount factor 7 G (0,1). 


In the case of MDPs where observations are states, o* is often denoted as st- The Q-function, Q^s, a) 
is the expected discounted cumulative reward obtained by taking action a in state s and then follow 
ing an optimal policy thereafter. The celebrated Q-learning algorithm and its variants (|Sutton & 


Barto) 1998 1 can be used to learn the Q-function from data, by repeated applications of a stochastic 


approximation update rule on observed transitions (s, a, r, s'): 

Q{s, a) •(— Q{s, a) + f]{r -f 7 argmax( 5 (s', a') — Q{s, a)), 


where p is a step-size parameter. Once Q « Q*, the greedy policy, ttq{s) := argmax^ Q{s, a), is 
near-optimal. 


In non-Markovian problems like POMDPs ( |Kaelbling et al.| |1998| l, Ot provides partial information 
about the unobserved state St, and can be used to sequentially update the belief state. Although 
POMDPs have a solid theoretical foundation, their application often requires substantial domain 
knowledge to define the set of hidden states and observation probabilities. In this paper, we use the 
rich family of RNN/LSTM models to represent and learn hidden states in CRM-like tasks, motivated 
by their excellent capabilities of representation learning without much human intervention. Finally, 
another promising approach to modeling non-Markovian problems is to use a predictive state rep¬ 
resentation, or PSR ( |Littman et al.| 2QQ2\ . While PSR has great representational power and may 
be easier to learn from data than POMDPs, like POMDPs, applying PSR often requires substantial 
domain knowledge to design features or a kernel function (|Boots et al.||201 l)l. 
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2.2 Deep (Reinforcement) Learning 

Recently, deep learning has seen exciting successes in solving reinforcement-learning problems. 
Most prominent is the recent use of a deep Q-network (DQN) in Q-leaming to solve a large number 
of Atari games ( |Mnih et ah 2015| l, although neural networks have been used in some of the classic 
RL applications like TD-Gammon (|Tesauro[ 1995|). 


For partially observable environments, deep learning may also be used to represent and track hidden 
states, without much domain knowledge. For example, Deng et al. ( 2013| l apply a deep network 
to track belief states in a spoken dialogue system. Examples in RL include earlier applications of 
recurrent neural networks to control problems (Bakker 2002 [ Lin[|1993|l, and more recently to Atari 
games (Hausknecht & Stone 2015[) and text games (|Narasimhan et al.[ 2015 1 . In these works, an 


RNN or LSTM is used to represent a Q-function, Q{s,a;9), parameterized by 6. We call these 
models RL-RNN and RL-LSTM, respectively. A variant of Q-learning updates parameters on the 
observed transitions (s, a, r, s') by; 

0 •(— 0 + p + 7max(5(s', a') — Q{s, a)^ VeQis, a; 9). 

In contrast to previous work, we propose a new, hybrid model which combines supervised learn¬ 
ing and reinforcement learning. During training, we use the supervised signals to learn the state 
representation, then jointly train DQN to approximate the Q-function. 


2.3 Customer Relationship Management 


In general, CRM refers to data-driven approaches to determining corporate practices in order to 
maximize lifetime value of customers ( |Kumar & Reinartz] |2012| l. Central to CRM is the notion 
of lifetime value (Dwyer |1997| l, as opposed to short-term measures of customer value. In the past, 
(un)supervised learning has been applied to CRM on problems like customer segmentation, although 
the focus is on obtaining useful insights to support business decision making ( [Berry & Linoff|p004| 
page 7). Our work, in contrast, tries to close the decision-making loop: we aim to develop machine- 
learned models that directly suggest actions to maximize LTV of customers. 


We thus take an RL approach to learn a decision-making policy from data. jPednault et al.j ( |2002| ) 
consider cost-effective decision making in CRM, using batch Q-leaming to learn a piecewise linear 
Q-function. Later, Silver et al. (2013 1 apply variants of Q-learning to learn a linear Q-function in the 
CRM task of email campaigns. In contrast, we use the much more flexible function approximator 
of neural networks to learn the Q-function, which substantially outperforms a strong baseline that 
uses linear Q-functions in experiments. More importantly, it provides an effective way to deal with 
hidden states that are not considered by these authors. 


Closest to this work is a recent application of DQN to CRM (Tkachenko 2015| l, which uses the same 
benchmark data. Our work differs in a number of substantial ways. First, we focus on the challenge 
of non-Markovian CRM problems, overcoming the suboptimality associated with his use of DQNs. 
Second, we employ state-of-the-art deep learning models to capture hidden states, and develop a 
novel, hybrid model combining the strengths of supervised and reinforcement learning. Finally, we 
adopt a different evaluation methodology that is more appropriate for RL tasks, while the one in 
Tkachenko ( |2015| l is fundamentally flawed. Details of these distinctions will be clearer later in the 
paper. 


3 Model 

Recall that the our goal is to learn the optimal Q-function from a sequence of (or sequences of) 
interaction histories in the form of (oi, Oi, ri, 02 , 02 , r 2 ,...). A common approach is to optimize 
a recurrent network to approximate the Q-function, given such networks’ strong ability to capture 
long-term dependency. Once a good approximation is obtained, a near-optimal policy can be readily 
defined that selects actions greedily. Such an approach, however, mingles policy learning and long¬ 
term dependence learning during network optimization, which makes it challenging to find and train 
a single network for both purposes simultaneously. 
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Motivated by this challenge, we propose to use a hybrid model with two networks, and more im 
portantly, a novel joint optimization procedure for this model. The model is designed based on the 
following observations, reinforced learning (RL) models, as described in 


A.2 


maximize long-term rewards. In contrast, supervised learning (SL) models, as described in A.l 


can be trained to 
can 


be optimized to predict observations and immediate rewards, thus having the potential to better rep¬ 
resent and infer hidden states. With such complementary strengths, it is beneficial to take a hybrid 
approach, which uses SL for hidden-state representation learning and RL for policy learning. More¬ 
over, these two components should not be optimized separately, ideally, the SL component should 
learn an internal state representation that allows the RL component to maximize long-term reward. 
Decoupling the training of two networks likely results in a worse learned policy. 


Specifically, we propose a new family of hybrid models combining supervised learning and re¬ 
inforcement learning, trained in a joint fashion; the SL component can be for example an RNN or 
LSTM; the RL component is a DQN. The resulting models are called SL-RNNh-RL-DQN (Figure[T]) 
and SL-LSTDh-RL-DQN, respectively. 



Figure 1: Supervised RNN + Reinforced DQN: ot is the observation, ht is the hidden state for RNN, 
is the predicted observation for time t + l,Rt is the predicted reward, Q{s, a)t is the predicted 
Q-value at time t. The blue parts correspond to an unfolded RNN for SL, and the red parts for DQN. 
In this hybrid model; the input of DQN is the hidden layers of the supervised RNN model. 


For training the hybrid model, we use a joint supervised-reinforced approach. First, we train an RNN 
(or LSTM), which learns hidden states from signals including next observations and immediate 
rewards. Then, the learned hidden states are the input to DQN, which learns Q-function of a near- 
optimal policy. These two training steps are interleaved in each SGD iteration. 


The difference between these models and RL-RNN/RL-LSTM ( |Bakker|[2002HHausknecht & Stone| 
2015t Lin 1993[ Narasimhan et al. 2015|l. is that, during training, the supervised signals are used 


to learn the state information, and are back-propagated to the head of RNN/LSTM, while the RL 
signals are only back-propagated to the hidden layers of RNN for DQN training, and do not involve 
in the RNN training. 


4 Experiments 


4.1 DataSet 

In this paper, we use the 1998 KDD Cup direct mailing datase0 which has been used in the RL 
literature (Marivate |2015 Pednault et al.| 2002 | l for various purposes. It was collected by a non¬ 
profit organization, PVA, who provides programs and services for US veterans with spinal cord 
injuries or disease. PVA raises money via direct mailing campaigns. The dataset contains a record 
for every donor who received the mailing and did not make a donation in the 12 months before that. 
For each of them, it is recorded whether and how much they donated as a response to the campaigns. 
Apart from that, data is given about the previous and current mailing campaign, as well as personal 
information and the giving history of each lapsed donor. The training data is collected for 23 distinct 
periods for 95,412 donors, resulting in over 2M transition tuples. Each donor’s interaction history 
can be viewed as a time series of 23 steps. 


’https://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html 
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Direct mailing campaigns are a typical CRM task, where the decision is on what type of 
email to send, in order to maximize long-term profit (or cumulative donation in the case of 
PVA). In the dataset, we found 12 actions, including 11 mailing types and 1 inaction (cor¬ 
responding to non-response in the dataset). The resulting data for each client is a sequence 
(oi, oi, ri,..., 022 , 022 , ''’ 22 , 023 ), where: 






Ot is the current observation of the client. As in previous work Tkachenko ( 2015| l, it is a 
5-dimensional vector consisting of (1) how recently the donor donated last, (2) how fre¬ 
quently she donates, (3) her average donation amount, (4) how many times PVA sends her 
a mail in the last six months, and (5) how many times PVA has sent her mails. The first 
three correspond to the well-known Recency-Frequency-Monetary value model in CRM, 
and the other two are application specific. 
at is one of the 12 actions taken by PVA. 

rt is the immediate reward received after taken action at - In the raw dataset, the reward is 
the amount of donation in dollars, ranges from $0 to $1000. 


4.2 Evaluation Methodology 


Reliable evaluation has been a challenge in reinforcement learning when no simulator (like those 
used in Atari games) is available. The approach of Tkachenko ( 2015[ ) proceeds as follows. After 
a model is optimized using training data, it is run on test data to select actions in every step. The 
test data is then partitioned into two subsets: the SAME set consists of transitions where the model’s 
selected action is the same as the action in the data; all other transitions are in the DEVIATED set. 
Clearly, the partition into SAME and DEVIATED is model dependent. Finally, a model is considered 
better if its average reward in the corresponding SAME set is higher. 


While the procedure above might sound intuitive and appealing, it is fundamentally flawed for eval¬ 
uating action-section models. First, it focuses on myopic rewards, thus fails to reflect the capability 
of RF models whose very aim is to optimize (long-term) FTV of customers. The second problem is 
more subtle but equally severe: that evaluation is really about finding correlation rather than causa¬ 
tion, which can lead to many paradoxes. For instance, imagine a model that learns how PVA selected 
actions when collecting this dataset. It can then select actions based on whether the client is generous 
(using information encoded in observations): it selects the same action as PVA if and only if the client 
is generous. This way, the model is able to “game” the evaluation protocol of |Tkachenko| ( |2015] l by 
enforcing its SAME set to contain only generous clients who tend to donate more. However, such a 
cherry-picking model is not expected to do anything better than the data-collection policy of PVA. 


Given these important drawbacks, we adopt a different evaluation that is common in the RL litera¬ 
ture: we use the dataset to build a simulator, and rely on the simulator to generate synthetic CRM 
interaction sequences for training and evaluating different action-selection models. While building 
a model is in generally nontrivial, this data presents a factored structure: at any step t, given at and 
rt, the five components in the observation vectors evolve (from ot to Ot+i) independently. A similar 
approach was also taken in previous work (Pednault et al. 2002| l. More specifically, at step t, the 
simulator takes the observed history ht and action a* as input to predict: 


next observation Ot+i: the 5-dimensional observation is discrete, and individual dimensions 
evolve independently of each other. We therefore build an observation probability table for 
each observation dimension, and then sample next observations using these tables, 
reward r^: in the experiments, we build reward function using an RNN, trained to predict 
reward r using its internal history summary ht- This simulator thus creates a realistic sce¬ 
nario that allows hidden states and long-term effects on customers (Netzer et al. 2008|l. 


4.3 Experiment Setup 

We found smaller data are enough to yield strong policies, therefore only use a random subset of 
donors of the entire data for experiments. We tried four data sizes of varying number of donors, each 
having 23 steps, so that the total number of transitions is {50K, lOOK, 200K, 500K}. The data were 
then split into training, validation, and test sets with proportions 4:1:1. 

To generate training data, we started with the initial observation vector of donors in the training set, 
and ran one of the following data-collection policies to select actions: 
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Table 1: Summary of Evaluation Settings 
Behavior Policy |_Data Size_ 

U, M, R I 50K, lOOK, 200K, 500K 


Table 2: Evaluation Setting Configuration 



Behavior Policy 

Data Size 

El 

M 

lOOK 

E2 

jU, M, R} 

lOOK 

E3 

M 

{50K, lOOK, 200K, 500K} 


• Uniformly random (U): at any step, an action is chosen uniformly at random from the set 
of actions. 

• Probability-matching (M): at any step, an action is chosen with probability proportional to 
its frequency in raw data. 

• Real (R): the chosen action is the actual one recorded in raw data. The simulator is used 
only to regenerate reward. 

Table summarizes all choices along three dimensions in our experiment setup. We fixed 7 = 0.9 
and compared the average per-step reward (donation) of the learned policies. We ran each setting at 
least 5 times and report the average. 

The following models are used as baselines in our experiments: 

• One approach is to treat the problem as supervised learning (SL), where we predict whether 
the short-term reward rt is larger than a threshold r. The input can be the current obser¬ 
vation Ot or the sequence of observations (oi,..., Oj), leading to a multi-layer neural net¬ 
works (DNN) and RNN/LSTM, respectively. In our experiments, we found r = 0 to work 
best empirically. At test time, the model, denoted R, takes current observation or the obser¬ 
vation sequence as input, and selects actions greedily according to its reward predictions. 

• The DQN ( |Mnih et ah] 2015| l with history window is a deep RL baseline. We select the best 
model from candidate history window lengths {1,2, 3}, and report the best performance. 

• Two other deep RL baselines, RL-RNN and RL-LSTM, are similar to DQN, but are ex¬ 
pected to handle partial observability by explicitly modeling long-term dependencies of fu¬ 
ture rewards on history (|Bakker 
[etaklllmSl l. 

• Joint models with separate training: we use the same models described in Section]^ but 
training is different. We first train RNN/LSTM to minimize square prediction error of ob- 
serwation/reward until the network converges. Then they are fixed and are used to generate 
a hidden state representation that is the input to DQN. 

More details of the baselines are described in Appendix [A| 

4.4 Results 

Three sets of experiments were done, as summarized in Table Each experiment focuses on a 
particular aspect and is discussed in detail in the following subsections. 

Experiment El This experiment is to investigate how hidden states affect relative performance of 
various models. Recall that with the RNN simulator, rewards are a function of the current observation 
as well as history up to that step. In other words, the model must be able to infer and track such 
hidden states in order to maximize cumulative rewards. 

Erom Table 1^ and Eigure]^ we can see that DQN significant outperform DNN. RL-RNN and RL- 
LSTM significantly outperform all SL models. Lurthermore, there is a clear advantage of RLh-RNN 


2002 [ Hausknecht & Stone 2015[ Lin 1993[ Narasimhan 


6 

























Under review as a conference paper at ICLR 2016 


Table 3: Supervised Learning and Reinforcement Learning under RNN Simulator. The superscripts 
a, b, c and d indicate statistically significant improvements (p<0.05) over DNN, SL models, {SL 
models, DQN}, {RL-RNN, RL-LSTM} respectively. SL-RNN + RL-DQN*, and SL-LSTM + RL- 
DQN* are separate training, e indicates the joint training of hybrid models significantly improve 
over the corresponding model with separated training. 

SL Models Avg Reward ($) RL Models Avg Reward ($) 


DNN 

8.10 

DQN 

9.14“ 

RNN 

9.03 

RL-RNN 

9.39“ 

LSTM 

9.01 

RL-LSTM 

9.35“ 



SL-RNN H- RL-DQN 

9.66^^ 



SL-LSTM H- RL-DQN 

9.61“* 



SL-RNN H- RL-DQN* 

9.49 



SL-LSTM H- RL-DQN* 

9.37“ 


Figure 2: Learning curve for RL models 



epoch 


and RL+LSTM over DQN, since the latter ignores history information. Third, “SL-RNN + RL- 
DQN” is significantly better than RL-RNN, and similarly when RNN is replaced by LSTM. 

Finally, Table also shows that separated parameter training is inferior to our proposed joint train¬ 
ing approach. With separated training, it is difficult to know whether and when the learned hidden 
state representation is good enough to enable DQN to learn a good Q-function. Our joint training 
approach, on the other hand, couples training of RNN/LSTM and of DQN. It thus can learn a state 
representation that facilitates Q-function learning in RL-DQN. 


Experiment E2 The second set of experiments is to investigate how the data-collection policy 
affects model performance. It is well-known that proper exploration is necessary to learn a good 
policy, and we examine how it affects our models empirically. From Figure ^ when actions were 
selected by U and M, qualitatively similar results can be obtained as before. However, when actions 
were chosen by R, all models’ performance decays, and reinforcement learning models are hurt 
much more that they are inferior to supervised learning models. 
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An examination of the raw data revealed that the actual data-collection policy by PVA seemed to 
be deterministic: the same action is applied to all donors at the same step. Therefore, little or no 
exploration exists in this dataset. Reinforcement learning algorithms tend to be more sensitive to 
lack of exploration than supervised learning algorithms, because the former in a sense try to predict 
what will happen far into the future. The less exploration, the more eri'or is introduced into the 
projection of an RL algorithm, which is consistent with what is shown in Figure]^ 



Figure 3: Supervised Learning and Reinforcement Learning under RNN Simulator with different 
simulation data (U, M, R). Each group has eight models: three SL models and five RL models. 


Experiment E3 In the last set of experiments, we varied data sizes to see how it affects each 
model’s performance. From Figure]^ we saw similar results with a wide range of data sizes. These 
preliminary results indicate our models are data-efficient, and the benefits over SL or RL models are 
consistent with different data sizes. 


A 


■ RNN 

■ LSTM 
■ l— IDQN 
■— 3RL_RNN 

3RL_LSTM 
iSL_RNN+RL_DQN 
3SL LSTM+RL DQN 


100k 200k 

data size 


500k 


Figure 4: Supervised Learning and Reinforcement Learning under RNN Simulator with different 
data size (5QK, lOOK, 2Q0K, and 500i^). Each group has eight models; three SL models and five 
RL models. 


5 Conclusions 

In this work, we propose a hybrid approach that uses recurrent deep learning models, and combines 
the strength of both supervised learning and reinforcement learning, to solve a CRM task, which is 
typical of real-world non-Markovian problems. In particular, our hybrid approach utilizes supervised 
signals in training data to learn hidden-state representations, and then jointly trains an DQN (using 
reinforcement learning) to optimize the control for maximizing long-term rewards. Through a large- 
scale experimental analysis under different settings, we showed that the proposed hybrid models 
significantly outperform other state-of-the-art SL/RL models across the board: (1) Deep RL is more 
effective than SL for optimizing lifetime values; (2) RL with RNN/LSTM models is a promising 
approach to solving non-Markovian tasks with long-term dependencies; (3) It is promising to use 
memory networks models to learn hidden-state representations in a supervised learning manner, with 
the DQN jointly trained for non-Markovian tasks. Beyond the analysis, our experimental results 
demonstrate the promise of deep reinforcement learning for specific CRM tasks. 
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The experimental results repotted in this paper suggest multiple interesting directions for future 
work. One is to explore the use of recurrent networks in model-based or policy-based RL, as opposed 
to the value-function-based approaches taken in this work. Another important direction is to capture 
latent structures of actions, in order to facilitate generalization across actions and to handle newly 
emerged actions common in a range of applications. 
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A Baseline Models 

This section gives additional details for the supervised-learning and deep RL baselines we used in 
the experiments. 

A.l Supervised Learning 

The first baseline is to treat the problem as SL, in which one tries to predict which action leads 
to higher expected (immediate) reward given the interaction history so far. In our experiments, we 
formulated the problem as regression with the raw reward signal as target. For each transition tuple 
(o, a, r, o') from training data, we tried to learn the regression of r given observation o and possibly 
the history that led to this transition, depending on which network model is used. This reduction 
results in standard deep learning models with mean squared error as the loss function for training. 

Several network architectures are used, with and without built-in modeling of long-term dependency: 

• Multi-layer (deep) neural networks (DNN) breaks an interaction history into individ¬ 
ual transitions, {{ot,at,rt, Ot+i)}t=i, 2 ,...- The network is learned to predict rt based on 
{ot,at), for rt > r. In our experiments, we found r = 0 to work best empirically. At 
test time, the model, denoted R, takes current observation o as input, and selects actions 
greedily according to its reward predictions: argmax^ R{o, a). 

• RNN and LSTM can model long-term dependency in a customer’s interaction history. As 
shown in Figurej^for the case of RNN, the interaction history can no longer be decomposed 
into separate transitions like DNN. At step t, the model is updated using observation ot, 
reward rt and the cuiTent internal history summary, ht-i, which is maintained recursively 
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in RNN. At test time, the model selects actions in a similar fashion, based on both the 
current observation and the current internal history summary. The case for LSTM is similar. 


R(s.a)t-2 R(s,d)t-i R(s,d)t Ris,a)t+i R(s,a)t+2 



Ot-2 Ot Of+I Ot+2 


Figure 5; An unfolded supervised learning RNN: ot is the observation, ht is the hidden state for 
RNN, i?(s, a)t is the predicted reward at time t, where s is the ht of RNN. 


A.2 Reinforcement Learning 


SL models above only considers immediate rewards. In contrast, RL takes future rewards into ac¬ 
count and aims to optimize total long-term reward directly, which is desired in CRM task when one 
tries to optimize LTV of customers. 


Our first deep RL baseline is DQN |Mnih et al.| ( 2015| l, where we treat ot as state St, and optimize net¬ 
work parameters to obtain an approximate Q-function, Q{s, a). Once a good Q-network is learned, 
it can be used to select actions in a greedy fashion: 7rQ(s) := argmax^ Q{s, a). 


The second and third deep reinforcement learning baselines, RL-RNN and RL-LSTM, are simi¬ 
lar to DQN, but are expected to handle partial obseiwability by explicitly modeling long-term de¬ 
pendencies of future rewards on history |Bakker| ( |T002| l; Hausknecht & Stone ( 2015| l; Lin ( |1993| l; 
INarasimhan et al.| ( |2015| l. Similar to RNN and LSTM in supervised-leaming models, the Q-network 
now is a function of the current observation Ot and the current internal history summary ht-i- Again, 
the internal history summary ht is updated recursively as time goes on. Actions are selected greedily 
after the Q-network is learned. 


Q(s,a)t_2 Q(s,a)t Q(.s.a)t+i Q(.s,a)t+2 



Ot-2 Ot Ot+i Ot+2 


Figure 6: Reinforcement learning with RNN (RL-RNN): ot is the observation, ht is the hidden state 
for RNN, Q{s, a)t is the predicted Q-value for action a at time t, and s is ht- 


In practice, training DQN or RL-RNN/RL-LSTM may be unstable, due to dependence of transition 
tuples in an interaction history. One variant, as used in previous work Mnih et al. ( |2015 i, is to 
use two Q-networks: one network (the “target network”) is used to define the target value in Q- 
learning updates, r + j max^/ Q{s', a'; 9), while the other is used for parameter updates. When the 
latter network’s parameter converges, it becomes the target network, and the process repeats until 
convergence. We use the same variant for DQN, RL-RNN and RL-LSTM in the experiments. 
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